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We have developed a simple, phenomenological, Monte-Carlo code that predicts the three-dimensional structure 
of globular proteins from the DNA sequences that define them. We have applied this code to two small proteins, 
the villin headpiece (IVII) and colel rop (IROP). Our code folds both proteins to within 5 A rms of their native 
structures. 
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1. PROTEINS 

A protein is a linear chain of amino acids. The 
proteins of natural living organisms are composed 
of 20 different types of amino acids. A typical 
protein is a polymer of 300 amino acids, of which 
there are 20^°° = 2 x 10^^° different possibilities. 
The human body uses about 80,000 different pro- 
teins for most of its functionality, including struc- 
ture, communication, transport, and catalysis. 

The order of the amino acids in the proteins of 
an organism is specified by the order of the base 
pairs in the deoxyribonucleic acid, DNA, of its 
genome. Human DNA consists of 10^ base pairs 
with a total length of 3m. Since three base pairs 
specify an amino acid, the code for the 80,000 
human proteins requires only 3 x 300 x 80, 000 — 
7 X 10^ base pairs or 7% of the genome. 

1.1. Amino Acids 

The twenty amino acids differ only in their side 
chains. The key atom in an amino acid is a car- 
bon atom called the a-carbon, C^,. Four atoms 
are attached to the Cq by single covalent bonds: 
a hydrogen atom H, a carbonyl-carbon atom C, 
a nitrogen atom N, and the first atom of the side 
chain R of the amino acid. The carbonyl car- 
bon C is connected to an oxygen atom by two 
covalent bonds and to a hydroxyl group OH by 
another covalent bond; the nitrogen atom N is at- 
tached to two hydrogen atoms, forming an amine 



group NH2. The backbone of an amino acid is 
the triplet N, Ca, C. 

Of the 20 amino acids found in biological sys- 
tems, 19 are left handed. If one looks at the Cq. 
from the H, then the order of the structures CO, 
R, and N is clockwise (CORN). The one excep- 
tion is glycine in which the entire side chain is a 
single hydrogen atom; glycine is not chiral. 

1.2. Globular Proteins 

There are three classes of proteins: fibrous, 
membrane, and globular. Fibrous proteins are 
the building materials of bodies; collagen is used 
in tendon and bone, a-keratin in hair and skin. 
Membrane proteins sit in the membranes of cells 
through which they pass molecules and messages. 
Globular proteins catalyze chemical reactions; en- 
zymes are globular proteins. 

Under normal physiological conditions, saline 
water near pH=7 at 20-40 °C, proteins assume 
their native forms. Globular proteins fold into 
compact structures. The biological activity of 
a globular protein is largely determined by its 
unique shape, which in turn is determined by 
its primary structure, that is, by its sequence of 
amino acids. 

1.3. Kinds of Amino Acids 

The amino acids that occur in natural living 
organisms are of four kinds. Seven are nonpolar: 
alanine (ala), valine (val), phenylalanine (phe). 
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proline (pro), methionine (met), isoleucine (ile), 
and leucine (leu). They avoid water and are said to 
be hydrophobic. Four are charged: aspartic acid 
(asp) and glutamic acid (glu) are negative, lysine 
(lys) and arginine (arg) are positive. Eight are 
polar: serine (ser), threonine (thr), tyrosine (tyr), 
histidine (his), cysteine (cys), asparagine (asn), 
glutamine (gin), and tryptophan (trp). The four 
charged amino acids and the eight polar amino 
acids seek water and are said to be hydrophilic. 
Glycine falls into a class of its own. 

1.4. Protein Geometry 

When two amino acids are joined to make a 
dipeptide, first the hydroxyl group OH attached 
to the carbonyl carbon C of the first amino acid 
combines with one of the two hydrogen atoms at- 
tached to the nitrogen N of the second amino acid 
to form a molecule of water H2O, and then a pep- 
tide bond forms between the carbonyl carbon C 
of the first amino acid and the nitrogen N of the 
second amino acid. A peptide bond is short, 1.33 
A, and resists rotations because it is partly a dou- 
ble bond. 

To a good approximation, the six atoms Cqi, 
C[, Oi, Ni, Hi, and Ga2 he in a plane, called 
the peptide plane. If a third amino acid is added 
to the carbonyl carbon of the second amino 
acid, then the six atoms Ca2 ■ ■ -Cqs also will lie 
in a (typically different) plane. Exceptionally, the 
peptide plane of proline is not quite flat because 
the side chain loops around, and its third carbon 
atom forms a bond with the nitrogen atom of the 
proline backbone. 

1.5. The Protein Backbone 

The protein backbone consists of the chain of 
triplets (N C„ C')i, (N C„ C')2, (N C„ C')3, • • 
(N Ca C')n. Apart from the first nitrogen Ni 
and the last carbonyl carbon C^, this backbone 
(and its oxygen and amide hydrogen atoms) con- 
sists of a chain of peptide planes, Cqi . . .Cq.2 ■ • ■ 
Can-i ■ ■ -Can- Siucc the angles among the four 
bonds of the Ca 's are fixed, the shape of the back- 
bone of peptide planes is determined by the angles 
of rotation about the single bonds that link each 
Ca to the N that precedes it and the C that fol- 
lows it. The angle about the Ni-Cai bond is called 



that about the Cai-C'^ bond is tpi. The 2N an- 
gles {(pijipi) . . . (4>n,'^n) determine the shape of 
the backbone of the protein. These angles are the 
main kinematic variables of a protein. The prin- 
cipal properties of proteins are discussed in the 
classic article by Jane Richardson 0] . 

2. PROTEIN FOLDING 

The problem of protein folding is to predict the 
natural folded shape of a protein under physiolog- 
ical conditions from the DNA that defines its se- 
quence of amino acids, which is its primary struc- 
ture. This difficult problem has been approached 
by several techniques. Some scientists have ap- 
plied all-atom molecular dynamics We have 
used the Monte Carlo method in a manner in- 
spired by the work of Ken Dill et al. |^ . 

Our Monte Carlo simulations are guided by 
a simple potential with three terms. The first 
term embodies the Pauli exclusion principle. Be- 
cause the outer parts of atoms are electrons which 
are fermions, the Pauli exclusion principle re- 
quires that the side chains of a protein not over- 
lap by more than a fraction of an angstrom. In 
our present simulations, we have represented each 
side chain as a sphere centered at the first carbon 
atom, the C/3, of the side chain or at the hydrogen 
atom that is the side chain in the case of glycine. 

The second term represents the mutual attrac- 
tion of nonpolar or hydrophobic amino acids. In 
effect the water electric dipoles, the free protons, 
the free hydroxyl radicals, and the other ions of 
the cellular fluid attract the charged and polar 
amino acids of a protein but leave unaffected the 
nonpolar amino acids. The resulting net inward 
force on the nonpolar amino acids drives them 
into a core which can be as densely packed as an 
ionic crystal. 

The third term is a very phenomenological rep- 
resentation of the effects of steric repulsion and 
hydrogen bonding. For a given amino acid, this 
term is more negative when its pair of angles (pi 
and Tpi are in a zone that avoids steric clashes be- 
tween the backbone and the side chain and that 
encourages the formation of hydrogen bonds be- 
tween NH+ and O^ groups. One of these Ra- 
machandran zones favors the formation of a he- 
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lices, others favor (3 structures. 

We incorporate these zones in a Metropolis step 
with two scales, which we call zoning with mem- 
ory. Each Monte-Carlo trial move begins with 
a random number that determines whether the 
angles i^i and ipi of residue i will change zone, 
e.g., from its present zone to the a zone, the (3 
zone, or to the miscellaneous zone. If the zone 
is changed, then the angles 4>i and -0^ revert to 
the values they possessed when residue i was last 
in that zone. The trial move is then modified 
slightly and randomly. 

2.1. Rotations 

We have derived a simple formula for the 3x3 
real orthogonal matrix that represents a right- 
handed rotation hy 6 — \9\ radians about the axis 



cos 61 / - • J sin 6* -I- (1 - cos 6) §{9^ 



-ie-j 



in which the generators {Jk)ij = i^ikj satisfy 
[Ji, Jj] = ieijkJk and T means transpose. In 
terms of indices, this formula for R{6) = e"** "^ is 



R{9)ij — Sij cos 6 — sinO eijkOk + (1 — cos( 



In these formulae is totally antisymmetric 
with £123 — 1, and sums over k from 1 to 3 are 
understood. 

2.2. Distance 

A conventional measure of the quality of a the- 
oretical fold is the mean root-mean-square dis- 
tance d between positions of the a carbons 
of the folded protein and those x{i) of the native 
structure of the protein, 
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The native states of many proteins are avail- 
able from littp: / / www.rcsb.org/ pdb7| . We have 
derived a formula for this distance in terms of 
the centers of mass f = (1/?^) '^O) 
X — X]j=i 2^0)7 the relative coordinates 

q{i) — f{i) — f and y{i) — x{i) — x, their inner 

products = J2t=i ^i^y a'^d = YJt=i 27(*)^; 
and the matrix that is the sum of their outer prod- 
ucts B = If = Bik = 



J27=i l{^)iy{^)k denotes the transpose of this 3x3 
matrix B and tr denotes the trace, then the rms 
distance d is 
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g2 _ 2tr(BB"^) ' 



2.3. Tvifo Proteins 

We have performed simulations on a protein 
fragment of 36 amino acids called the villin head- 
piece (IVII). We begin by rotating the 2n dihe- 
dral angles (p and ?/; of the protein to tt, except for 
the angle (p of proline. In this denatured starting 
configuration, the average rms distance d is 29 A. 
Our best simulations so far fold the villin head- 
piece to a mean rms distance d that is slightly less 
than 5 A from its native state. 

Our second protein is a 56-residue fragment of 
the 63-residue protein colel rop (IROP). From a 
denatured configuration with d = 55 A, our code 
folds this protein to a mean rms distance d of 
slightly less than 3.2 A from its native state. 
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